[Cosmos] Hybrid Search query pipeline #38275

simorenoh · 2024-11-01T20:56:27Z

Adds support for performing full text search queries through the introduction of the hybrid search query pipeline. This consists of the newly added hybrid_search_aggregator, which performs the necessary query steps to obtain the needed results.

With these changes, the SDK can now interpret queries utilizing key functions like FullTextContains(), FullTextContainsAll(), FullTextContainsAny(), Order By Rank <FullTextFunction>(), and Order By Rank RRF().

The design doc for the implementation can be found here: Hybrid Search Doc.
The new README in this PR also has additional information.

Still missing in this PR at the moment:

Samples for both sync and async
Tests for both sync and async
Additional README information on how to run these queries.
Cleaning up/ aggregating some of the repeated logic.
Possible optimizations to query plan fetching - will probably be addressed in a separate PR. Issue: [Cosmos] make queries fetch query plan in every query #38577

azure-sdk · 2024-11-01T21:23:09Z

API change check

API changes are not detected in this pull request.

sdk/cosmos/azure-cosmos/README.md

Co-authored-by: Anna Tisch <[email protected]>

…r-python into fts-query

FabianMeiswinkel · 2024-11-15T23:24:36Z

sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/aio/endpoint_component.py

@@ -173,7 +173,7 @@ def __init__(self, execution_context, aggregate_operators):
        for operator in aggregate_operators:
            if operator == "Average":
                self._local_aggregators.append(_AverageAggregator())
-            elif operator == "Count":
+            elif operator in ("Count", "CountIf"):


There are no changes to _CountAggregator in this PR - how is this supposed to start supporting CountIf?

FabianMeiswinkel · 2024-11-15T23:27:46Z

sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/endpoint_component.py

@@ -187,7 +187,7 @@ def __init__(self, execution_context, aggregate_operators):
        for operator in aggregate_operators:
            if operator == "Average":
                self._local_aggregators.append(_AverageAggregator())
-            elif operator == "Count":
+            elif operator in ("Count", "CountIf"):


No changes to CountAggregator - how woudl this support new Aggregate?

FabianMeiswinkel · 2024-11-15T23:28:37Z

sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/execution_dispatcher.py

@@ -136,6 +145,22 @@ def _create_pipelined_execution_context(self, query_execution_info):
                                                                                        self._query,
                                                                                        self._options,
                                                                                        query_execution_info)
+        elif query_execution_info.has_hybrid_search_query_info():


can thge exception factory be refactroed to share across sync and async?

FabianMeiswinkel · 2024-11-15T23:31:31Z

sdk/cosmos/azure-cosmos/test/test_query_hybrid_search.py

+        self.test_db = self.client.create_database(str(uuid.uuid4()))
+        self.test_container = self.test_db.create_container(
+            id="FTS" + self.TEST_CONTAINER_ID,
+            partition_key=PartitionKey(path="/id"),


Same comment I left for Java

think teh structure of the docs is not very helpful - You would need test cases to target individual physical partitions (both via PK in query request options or just via query plan by where ocndition) as well as real cross partition queries - technically you could still do it with id - but the addiitonal fulltext / hybrid-search becomes use-less on a single doc. So, I would argue there should be few logical partitions with 100 docs (coudl be same ids form the file) each -a nd then have tests searching across partiitons as wlel as scoped to individual partition - ideally also with HPK. You kind of have to add all these test cases to the matrix because we can't rely on it magically working given the extra step getting global statistics.

I think the test matrix will need to get extended quite a bit - maybe let's jump on a call Monday morning to discuss this - and also what tests are required for merge - vs. ok to add later.

FabianMeiswinkel

Looks good overall - main comment is about the test matrix/coverage.

simorenoh added 2 commits November 1, 2024 16:54

Create hybrid_search_aggregator.py

f0b9f26

others

2829bc1

github-actions bot added the Cosmos label Nov 1, 2024

simorenoh added 2 commits November 1, 2024 17:00

Update execution_dispatcher.py

daf04e8

Update execution_dispatcher.py

da4c295

simorenoh added 2 commits November 5, 2024 02:20

sync changes, need to look at vector + FTS/ skip + take

13482e3

async pipeline

2570238

simorenoh marked this pull request as ready for review November 5, 2024 21:25

simorenoh requested review from annatisch and a team as code owners November 5, 2024 21:25

simorenoh added 13 commits November 6, 2024 17:25

Merge branch 'main' into fts-query

20a5a8d

account for skip/take and simplify logics

d900083

small hack for now

7dce7c1

fixing top/limit logic

c266c6a

return only payload

2ecfbf8

fix hack

13a21c1

pylint

1b89369

simplifying further

66b1163

small changes

d6e199a

adds readme, buffer limit, simplifies

729c64f

simplify async, CI green

93eba14

Merge branch 'main' into fts-query

188e12f

Update hybrid_search_aggregator.py

835a72b

annatisch reviewed Nov 14, 2024

View reviewed changes

sdk/cosmos/azure-cosmos/README.md Outdated Show resolved Hide resolved

sdk/cosmos/azure-cosmos/README.md Outdated Show resolved Hide resolved

simorenoh and others added 5 commits November 14, 2024 16:47

Update sdk/cosmos/azure-cosmos/README.md

2dc6077

Co-authored-by: Anna Tisch <[email protected]>

update variable name

77ea6dd

Merge branch 'fts-query' of https://github.com/simorenoh/azure-sdk-fo…

3cdcc7f

…r-python into fts-query

add sync and async tests

705a14b

Update README.md

8052a14

simorenoh added 10 commits November 15, 2024 12:55

simplifications, test fixes

a86cbf8

add wrong query tests

0c4d9a4

pylint/cspell

c2242eb

Merge branch 'main' into fts-query

efce422

Update CHANGELOG.md

1fb4dfb

small changes

b10c580

test updates

84753dc

Update hybrid_search_data.py

70581ac

cspell, samples

f90914a

change tops

ff66988

simorenoh mentioned this pull request Nov 15, 2024

[Cosmos] make queries fetch query plan in every query #38577

Open

FabianMeiswinkel reviewed Nov 15, 2024

View reviewed changes

FabianMeiswinkel requested changes Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cosmos] Hybrid Search query pipeline #38275

[Cosmos] Hybrid Search query pipeline #38275

simorenoh commented Nov 1, 2024 •

edited

Loading

azure-sdk commented Nov 1, 2024

FabianMeiswinkel Nov 15, 2024

FabianMeiswinkel Nov 15, 2024

FabianMeiswinkel Nov 15, 2024

FabianMeiswinkel Nov 15, 2024 •

edited

Loading

FabianMeiswinkel left a comment •

edited

Loading

[Cosmos] Hybrid Search query pipeline #38275

Are you sure you want to change the base?

[Cosmos] Hybrid Search query pipeline #38275

Conversation

simorenoh commented Nov 1, 2024 • edited Loading

azure-sdk commented Nov 1, 2024

FabianMeiswinkel Nov 15, 2024

Choose a reason for hiding this comment

FabianMeiswinkel Nov 15, 2024

Choose a reason for hiding this comment

FabianMeiswinkel Nov 15, 2024

Choose a reason for hiding this comment

FabianMeiswinkel Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

FabianMeiswinkel left a comment • edited Loading

Choose a reason for hiding this comment

simorenoh commented Nov 1, 2024 •

edited

Loading

FabianMeiswinkel Nov 15, 2024 •

edited

Loading

FabianMeiswinkel left a comment •

edited

Loading